Yunxiao Wang

Exploratory Data Analysis On The Red Wine Quality Dataset

About the dataset

This dataset contains 1,599 red wines with 11 input features on the chemical properties of the wine and the output quality of the wine is based on at least 3 evaluations made by wine experts. The quality rating is on a scale of 0 (very bad) to 10 (very excellent).

My main goal of this analysis is to understand how chemical features affect quality of wine and to be able to predict the subjective quality of wine based on objective properties. However I will also look at other interesting relationships as I dig deeper into the dataset.

Univariate Plots Section

## [1] 1599   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## Classes 'tbl_df', 'tbl' and 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Quality

The median quality for red wines is 6.0 and mean quality is 5.636 which is lower than the median. The Min quality is 3.0 and Max quality is 8.0.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
## 
## FALSE 
##  1599
## [1] 280
## [1] 0.1751094
## [1] 0.8248906

Quality is mostly between 5 and 7 and relatively symmetric which is consistent with the median and mean. All qualities are integers. 82% of wines are either 5 or 6, which means it probably won’t be very easy to predict wine quality because the majority of provided data have almost identical rating.


Fixed Acidity

## 7.2 
##  67
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

The fixed acidity levels are rounghly centered around 7.5 g/dm^3, but the right tails is a little longer than the left. The mode of fixed acidity is 7.2 g/dm^3, median is 7.9 g/dm^3, mean is 8.32 g/dm^3.


Volatile Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Most volatile acidity levels are between 0.3 g/dm^3 and 0.7 g/dm^3. Median is 0.52 g/dm^3 and mean is 0.5278 g/dm^3.


Citric Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000
## 
## FALSE  TRUE 
##  1467   132
##   0 
## 132

The distribution of citric acid levels seems a little random with a few peaks. It’s worth noting the mode of citric acid is actually zero. Since citric acid can add freshness and flavor to wines, I wonder if these wines have low quality.

It turns out the quality distribution is not that different from that of the whole sample, which means other variable outweighed the citric acid level in the cases where wines have 0 citric acid.


Residual Sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

Most wines have residual sugar between 1.5 and 3. Median is 2.2 and mean is 2.539. But there are some wines have higher sugar levels, the highest residual sugar amount is 15.5 g/dm^3, which is still a lot lower than the threshold of what’s condidered as sweet(45 g/dm^3).


Chlorides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Most wines have chlorides between 0.05 g/dm^3 and 0.1 g/dm^3. Median is 0.079 g/dm^3 and mean is 0.08747 g/dm^3. The chlorides of this sample go all the way up to 0.611 g/dm^3.


Free Sulfur Dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

Before transformation, the distribution of free sulfur dioxide looks long tailed. After transforming the data by taking log10 to better understand the distribution, I did not gain much new insight. The distribution peaks around 6 mg/dm^3. The median is 14.00 mg/dm^3 and mean is 15.87 mg/dm^3.


Total Sulfur Dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00
## 
## FALSE  TRUE 
##  1597     2
## [1] 278 289

The distribution of total sulfur dioxide is again long tailed peaking about 15 mg/dm^3. I did not observe interesting pattern after transforming the x variable with log10. There are two outliers one with total sulfur dioxide level at 278 mg/dm^3, the other at 289 mg/dm^3.


Density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040
## [1] 10
## [1] 71
## [1] 1518

Density distribution is centered around 0.9968 g/cm^3, and seems relatively symmetric. It’s likely to be highly correlated with alcohol, sugar and other features. Median is 0.9968 g/cm^3, mean is 0.9967 g/cm^3. For all wines, density remains very close to 1 g/cm^3(density of water), with a minimum of 0.9901 g/cm^3. 1518 out of wines have a density less than 1 g/cm^3, 10 wines have exactly 1, and 71 have a density larger than 1. Since alcohol density is lower than pure water, wines that are heavier than water must have significant amount of sugar and other chemicals(compared with alcohol) to bring the density up. I will compare the sugar to alcohol ratios in wines with different densities.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.07087 0.18180 0.20950 0.23620 0.24750 1.32400
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2111  0.2615  0.3009  0.4450  0.3784  1.7110

Here are the density plots of sugar to alcohol ratios. The peak of heavier wines is to the right of the peak for lighter wines, which is to be expected. Both the median and mean of heavier wines are about 0.1 larger than that of lighter wines.


pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

The pH values of most wines fall between 3 and 3.7. The Maximum pH is 4.01, so all wines are acidic. Median is 3.31, mean is 3.311. pH value is very likely to be highly correlated with the acidity features.


Sulphates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000
## 
## FALSE  TRUE 
##  1591     8

Most wines have sulphates between 0.5 and 0.8 g/dm^3. Only 8 wines have more than 1.5 g/dm^3 sulphates. Median is 0.62 g/dm^3, mean is 0.6581 g/dm^3. Sulphates level also contrinutes to sulfur dioxide levels.


Alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90
## 9.5 
## 139
## 
## FALSE  TRUE 
##  1598     1

The distribution of alcohol levels peaks at 9.5%(mode of alcohol), the right tail is significantly longer than the left, expanding all the way to 14.9% which is also the only one larger 14%. Median is 10.2%, mean is 10.42%.


Univariate Analysis

What is the structure of your dataset?

There are 1599 wines with 13 features. The first one “X” is simply the index, leaving us only 12 features(fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, quality). All features are numerics except that quality is integer.

Other observations:

  • The median quality is 6.0 , the min quality is 3.0 and the max quality is 8.0.
  • The mode of citric acid level is 0.
  • Only two wines have more than 270 mg/dm^3 total sulfur dioxide, while the rest wines are all lower than 170.
  • 10 wines have exactly 1 g/cm^3 density, 71 wines have larger 1 g/cm^3. Wines heavier than water have a larger mean sugar to alcohol ratio.

What is/are the main feature(s) of interest in your dataset?

Quality is the main feautre. I’d like to find out which features can be used to predict the quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, total sulfur dioxide and alcohol are all likely to have effect in determining quality of wines.

Did you create any new variables from existing variables in the dataset?

Sugar to acid ratio was added to better understand difference between wines heavier than water and lighter wines.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

There are a few things I’ve noticed:

  • The mode of citric acid level is 0, 132 out 1599 wines have 0 citric acid. Citric acid can add to the “freshness” and flavor of wine, I was expecting wines with 0 citric acid to have lower quality rating, but the distribution of those wines is very similar to that of the whole dataset.
  • There are two wines with more than 270 mg/dm^3 total sulfur dioxide, which is much higher than the rest of the dataset.
  • I was expecting wine to be lighter than water, but 10 wines are the same as water, and 71 wines are heavier than water. The relatively large sugar to alcohol raitos in these wines can at least partially explain the density.

Bivariate Plots Section

Originally I thought residual sugar is also an important feature in determining quality, but now it seems that’s not the case.


pH vs. volatile acidity(acetic acid)

First I’ll explore pairs of features with relatively high correlation coeficients. There are 2 paris that surprised me the most.

## 
##  Pearson's product-moment correlation
## 
## data:  volatile.acidity and pH
## t = 9.659, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1880823 0.2807254
## sample estimates:
##       cor 
## 0.2349373
## 
##  Pearson's product-moment correlation
## 
## data:  volatile.acidity and pH
## t = 9.659, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
##  0.1731696 0.2948641
## sample estimates:
##       cor 
## 0.2349373

I was expecting a negative coefficient, but it was actually 0.23. Even 99% confidence interval doesn’t cross 0. However, after making the scatter plot, it looks like the distribution is somewhat random.

After reading wikipedia, I found out 1.0 molar concentration acetic acid(volatile acid) has a pH of 2.4, citric acid with the same concentration has 1.57 pH. The molar mass of citric acid is also more than 3 times that of acetic acid. So if we hold the density of both acids the same, the pH of acetic acid will be a lot higher than that of citric acid. In a extreme case, if we were to add acetic acid to pure citric acid, I’d expect the pH of the mixed acid might increase. In reality, acetic acid is not added to wine not pure citric acid, and the coefficient between pH and acetic acid is not high, it’s still making some sense to me how it can be positive now. Of course, correlation does not imply causation, maybe the real reason is other features included or even not included in the dataset that caused the change in pH and happened to coincide with the change of acetic acid content.


citric acid vs. volatile acidity(acetic acid)

## 
##  Pearson's product-moment correlation
## 
## data:  volatile.acidity and citric.acid
## t = -26.4891, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5856550 -0.5174902
## sample estimates:
##        cor 
## -0.5524957

At first, I was really suprised by the relatively large negative correlation coefficient between citric acid and acetic acid since I thought they’re mostly independent of each other. But after searching online, I’ve learned during fermentation, citric acid has a tendency to be converted into acetic acid, which can potentially explain the negative correlation coefficient: more volatile acid just means more citric acid has been converted into acetic acid.


quality vs. alcohol

Out of all the features, alcohol is the one with the highest correlation coefficient with quality. Next I will look at the scatter plot of quality vs. alcohol.

The vertical strips indicate all quality take integer numbers. Overall, the quality increases with more alcohol. The red line is the median at each quality rating. The blue line is a linear fit. I’ll look at other features that contribute significantly to quality.


quality vs. fixed acidity

## 
##  Pearson's product-moment correlation
## 
## data:  fixed.acidity and quality
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07548957 0.17202667
## sample estimates:
##       cor 
## 0.1240516

The red line here is a linear fit. The quality slightly increases as fixed acidity increases.


## 
##  Pearson's product-moment correlation
## 
## data:  volatile.acidity and quality
## t = -16.9542, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Too much volatile acid leads to an unpleasant, vinegar taste, so the quality of wines with higher volatile acid tend to receive a lower quality rating.


quality vs. citric acid

## 
##  Pearson's product-moment correlation
## 
## data:  wine$citric.acid and wine$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725

The overall trend is quality slightly increases with more citric acid in the wine as citric acid adds flavor to the wine.


quality vs. chlorides

## 
##  Pearson's product-moment correlation
## 
## data:  wine$chlorides and wine$quality
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.17681041 -0.08039344
## sample estimates:
##        cor 
## -0.1289066

Although the trend looks weak on the plot, but the statistical analysis indicate a negative coefficient. It would be interesting to see how much chlorides feature can improve our prediction model when I perform the multivariate analysis.


quality vs. total sulfur dioxide

## 
##  Pearson's product-moment correlation
## 
## data:  wine$total.sulfur.dioxide and wine$quality
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2320162 -0.1373252
## sample estimates:
##        cor 
## -0.1851003

Total sulfur dioxide seems to affect quality in a negative way.


quality vs. sulphates

## 
##  Pearson's product-moment correlation
## 
## data:  wine$sulphates and wine$quality
## t = 10.3798, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971

Sulphates seem to enhance the quality of wines which makes sense because they act as antimicrobial and antioxidant that protect the quality of wines.


quality vs. residual sugar

## 
##  Pearson's product-moment correlation
## 
## data:  wine$residual.sugar and wine$quality
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03531327  0.06271056
## sample estimates:
##        cor 
## 0.01373164

I expected sugar to be a feature that contribute to quality as well, maybe I have to look at this relationship again in multivariate analysis.


Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

None of the relationships look very linear. Some of them are relatively easier to spot when look at their scatter plots. Others require performing statistical analysis to help identify. Part of the reason that the relationships are generally not easy to see is that quality rating are all integers, and 82% of wines have either 5 or 6 quality rating.

  • The quality slightly increases with higer fixed acidity. It’s hard for me see that on the scatter plot, but statistical analysis yields a postive correlation coefficient with more than 99% confidence.

  • The quality drops with more volatile acid, this is not as clear for low volatile acidity range, but it becomes more obvious at higher range, since too much volatile acid creates an unpleasant taste.

  • Citric acid is usually used to improve the flavor of wines. This relationship is again not so clear at lower citric acid range, but becomes somewhat clearer at higher range.

  • Quality seems to decrease with chlorides in the wines. This is not very clear on the scatter plot either.

  • Quality slightly drops with more total sulfur dioxide as well, This is relatively obvious on the scatter plot for wines with 5 or higher quality.

  • I was a little suprised that the correlation coefficient for sulphates and quality is slightly higher than that of citric acid and quality, since citric acid enhances flavor while sulphates is only there as antimicrobial and antioxidant. I guess maybe the quality drops significantly without proper protection from the addition of sulphates. But if we focus at range with sulphates > 1.0, it seems too much sulphates actually reduce the quality of wines. It’s just most of the wine fall into the range where this is not the case.

  • Alcohol seems to be the most important feature that affects quality here. Generally, the quality of wines increases as the alcohol level increases.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Yes.

I was very surprised to see that pH slightly increases as volatile acidity(acetic acid) increases. At first, I thought this must be a somewhat random result for our particular dataset. But after doing some research, I realized acetic acid has higher pH compared to citric acid with the same molar concentration, and the molar mass of citric acid is much higher than that of acetic acid, so the pH for acetic acid is much higher than that of citric acid with the same density. When adding acetic acid to a relatively acid environment, acetic acid can probably increase pH. Or the real reason is other features that increase pH happen to coincide with higher acetic acid level.

The ohter relationship that surprised me was that citric acid decreases as acetic acidity increases. I was expecting them to be independent of each ohter. After searching online, I found out citric acid tends to be converted into acetic acid during fermentation, which might be the reason for this odd relationship.

What was the strongest relationship you found?

The strongest relationship is the relationship between fixed acidity and citric acid, but that’s just because a large portion of fixed acid is just citric acid. So maybe I should only use citric acid when predicting quality, since they’re overlapping too much. The feature that affects quality the most is alcohol, the correlaiton coefficient is 0.476.

Multivariate Plots Section

Again the main goal is to understand important determining factors of quality, in other words, the subjective variable quality as a function of different objective measurable features. Since the correlation coefficient between alcohol and quality is the highest among all input features, I will mostly use alcohol content as x variable in the following plots while using another input variable as the color. If the correlation is significant enough between the second input variable and quality, I should be able to observer a pattern in how color changes while holding alcohol content constant.

quality vs. alcohol and residual sugar

Colors in this plot are so close because of the existence of the few wines with extremely high residual sugar levels. I will make another plot wihout high sugar levels.

It still doesn’t look like sugar plays an important role in determining quality.


qualtiy vs. alcohol and volatile acidity

Quality is overall higher for lower volatile acidity.


quality vs. alcohol and citric acid

Quality is higher for higher citric acid level.


quality vs. alcohol and chlorides

I need to narrow down the range of chlorides content so colors are not so similar.

The pattern is not as clear for chlorides. I will still add chlorides in my predicting model and see how much difference it makes.


quality vs. alcohol and total sulfur dioxide

Again, I need to narrow down the range.

The pattern is not clear either, but there seems to be relatively more points with more sulfur dioxide for lower quality.


quality vs. alcohol and sulphates

Quality is higher for higher sulphates level.


Linear model for predicting quality

## 
## Calls:
## qual.m1: lm(formula = quality ~ alcohol, data = wine)
## qual.m2: lm(formula = quality ~ alcohol + volatile.acidity, data = wine)
## qual.m3: lm(formula = quality ~ alcohol + volatile.acidity + citric.acid, 
##     data = wine)
## qual.m4: lm(formula = quality ~ alcohol + volatile.acidity + citric.acid + 
##     total.sulfur.dioxide, data = wine)
## qual.m5: lm(formula = quality ~ alcohol + volatile.acidity + citric.acid + 
##     total.sulfur.dioxide + sulphates, data = wine)
## qual.m6: lm(formula = quality ~ alcohol + volatile.acidity + citric.acid + 
##     total.sulfur.dioxide + sulphates + chlorides, data = wine)
## 
## =================================================================================
##                        qual.m1   qual.m2   qual.m3   qual.m4   qual.m5   qual.m6 
## ---------------------------------------------------------------------------------
## (Intercept)            1.875***  3.095***  3.055***  3.248***  2.843***  2.985***
##                       (0.175)   (0.184)   (0.194)   (0.200)   (0.205)   (0.206)  
## alcohol                0.361***  0.314***  0.314***  0.302***  0.295***  0.276***
##                       (0.017)   (0.016)   (0.016)   (0.016)   (0.016)   (0.017)  
## volatile.acidity                -1.384*** -1.343*** -1.307*** -1.222*** -1.104***
##                                 (0.095)   (0.114)   (0.114)   (0.112)   (0.115)  
## citric.acid                                0.068     0.106    -0.043     0.065   
##                                           (0.103)   (0.103)   (0.104)   (0.106)  
## total.sulfur.dioxide                                -0.002*** -0.002*** -0.002***
##                                                     (0.001)   (0.001)   (0.001)  
## sulphates                                                      0.721***  0.908***
##                                                               (0.103)   (0.111)  
## chlorides                                                               -1.763***
##                                                                         (0.403)  
## ---------------------------------------------------------------------------------
## R-squared                 0.227     0.317     0.317     0.324     0.344     0.352
## adj. R-squared            0.226     0.316     0.316     0.322     0.342     0.349
## sigma                     0.710     0.668     0.668     0.665     0.655     0.651
## F                       468.267   370.379   246.976   190.618   166.962   143.910
## p                         0.000     0.000     0.000     0.000     0.000     0.000
## Log-likelihood        -1721.057 -1621.814 -1621.596 -1614.095 -1589.749 -1580.192
## Deviance                805.870   711.796   711.603   704.957   683.814   675.689
## AIC                    3448.114  3251.628  3253.192  3240.189  3193.499  3176.384
## BIC                    3464.245  3273.136  3280.078  3272.452  3231.138  3219.401
## N                      1599      1599      1599      1599      1599      1599    
## =================================================================================

As I expected, the linear model here doesn’t work that well. With a linear model, the features I selected only explain 35% of the change in quality.


Now I will look at density.

density vs. fixed acidity and residual sugar

Density of wines is higher for higher fixed acidity level and more residual sugar, which makes a lot of sense.

density vs. fixed acidity and alcohol

The relationship between density and alcohol is even more obvious than that between density and sugar.

Linear model for density

## 
## Calls:
## den.m1: lm(formula = density ~ fixed.acidity, data = wine)
## den.m2: lm(formula = density ~ fixed.acidity + alcohol, data = wine)
## den.m3: lm(formula = density ~ fixed.acidity + alcohol + residual.sugar, 
##     data = wine)
## 
## ================================================
##                   den.m1     den.m2     den.m3  
## ------------------------------------------------
## (Intercept)      0.991***   0.999***   0.999*** 
##                 (0.000)    (0.000)    (0.000)   
## fixed.acidity    0.001***   0.001***   0.001*** 
##                 (0.000)    (0.000)    (0.000)   
## alcohol                    -0.001***  -0.001*** 
##                            (0.000)    (0.000)   
## residual.sugar                         0.000*** 
##                                       (0.000)   
## ------------------------------------------------
## R-squared            0.446      0.654      0.746
## adj. R-squared       0.446      0.654      0.746
## sigma                0.001      0.001      0.001
## F                 1287.167   1508.935   1562.809
## p                    0.000      0.000      0.000
## Log-likelihood    8234.081   8610.211   8857.637
## Deviance             0.003      0.002      0.001
## AIC             -16462.161 -17212.423 -17705.274
## BIC             -16446.030 -17190.914 -17678.388
## N                 1599       1599       1599    
## ================================================

With all three features I selected and a linear model, 74.6% of the change in quality is explained.


Next I’ll analyze the positive correlation coefficient between pH and volatile acidity.

Since wines tend to contain less volatile acid if they have more citric acid, it’s possible that the citric acid is simply a more dominant factor, pH is lower with more citric acid, more citric acid usuallly means less volatile acid, which resulted in the positive correlation coefficient. If this is the main reason, if holding citric acid constant, I would expect pH to still be lower with more volatile acid. Next plot will tell if that’s really the case.

pH vs. citric acid and volatile acidity

First I’d like to note that I used exp(-pH) because pH is the negative log of the acitivity of hydrogen ions, which is really “the true acidity”.

Given the same citric acid level, it’s not clear to me whether pH tends to be higher or lower with more volatile acid. This probably means the above guess is not the main reason for the positive correlation coefficient. Thus my previous analysis may still be true: in a generallly acid environment, adding small amount of volatile acid can in fact increase pH. Because volatile acid content tends to decrease with higher citric acid content, I want to perform a linear fit with the two features being accounted for separatedly.

## 
## Calls:
## pH.m1: lm(formula = exp(-pH) ~ citric.acid, data = wine)
## pH.m2: lm(formula = exp(-pH) ~ citric.acid + volatile.acidity, data = wine)
## 
## =======================================
##                     pH.m1      pH.m2   
## ---------------------------------------
## (Intercept)        0.033***   0.030*** 
##                   (0.000)    (0.001)   
## citric.acid        0.016***   0.018*** 
##                   (0.001)    (0.001)   
## volatile.acidity              0.003*** 
##                              (0.001)   
## ---------------------------------------
## R-squared              0.297      0.306
## adj. R-squared         0.297      0.305
## sigma                  0.005      0.005
## F                    675.870    351.099
## p                      0.000      0.000
## Log-likelihood      6283.552   6292.912
## Deviance               0.036      0.036
## AIC               -12561.103 -12577.824
## BIC               -12544.972 -12556.316
## N                   1599       1599    
## =======================================

Here the linear model actually shows a positive coeffient for volatile acidity. My y variable is exp(-pH), so this means volatile acidity has the same effect on pH as that of citric acid in the sense that they both contribute to acidity. But the significance of contribution from volatile acid is much lower than citric acid, only about 1/6 in terms of the magnitude of coefficients. Since the coefficient is so low, the effect of volatile acidity is not nearly as important as citric acid.


Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The most important feature that contributes to quality is alcohol. The other relatively important feature are :

  • volatile acidity
  • sulphates
  • citric acid
  • to a less extent, total sulfur dioxide and chlorides

But linear model does not work very well in predicting quality with the features at hand.

Were there any interesting or surprising interactions between features?

Density is closely related to fixed acidity, alcohol and residual sugar. Wines are heavier with more fixed acid and sugar, less alcohol.

At first glance, the correlation coefficient between pH and volatile acid is positive, which seemed a little counter intuitive to me. After seperatating the effect from citric acid, volatile acid seems to also reduce pH, but to a much less extent comparing to citric acid.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes, I used linear model to predict quality, and study density and pH. However, the linear model on quality was not very good since quality does not depend on the features quite linearly. It’s a very subjective feature obtained from a small number of evaluations made by wine experts, so it’s likely to have a lot of randomness.

The density was described relatively well by linear model using alcohol, fixed acidity and sugar, because it’s completely objective, and these are likely the most important features that affect density. It’s also worth noting density is basically an average of all the ingredients in wines, so linear model should capture the key variations rather well.

I also used linear model to look at the exp(-pH) and volatile acidity, which shows that when holding citric acid constant, volatile acidity also reduces pH(increases exp(-pH)).


Final Plots and Summary

Plot One

Description One

My main goal is to better understand what input features affect quality and how much they affect the rating so it would 82.5% of the wines have quality of 5 or 6, meaning most wines are just considererd as of average quality. The distribution is similar to that of a normal distribution. Minimal quality rating is 3 and maximal quality rating is 8.

Plot Two

Description Two

Among all features in the dataset that affect quality rating, the correlation coefficient between alcohol and quality is the highest. The 95% confidence interval of the coefficient is [0.4374, 0.5132]. Thus it makes sense to look at the scatter plot of quality vs. alcohol(quality is a function of the rest of the features). In the plot, I’ve made quality as the x axis so the lines span across all quality ratings. Although the correlations is not very strong, it’s still clear the wines tend to have better quality with higher alcohol content. The red line is the median quality at every different quality value, the blue line is a linear fit.

Plot Three

Description Three

Among all features, alcohol and volatile acidity are the two most significant features in determining quality of wines. Generally speaking, a combination of high alcohol content and low volatile acidity makes a better wine. The coefficients between these two features and quality are 0.4762 and -0.3906 respectively. In the plot, the wines with medium to dark blue colors(7 and 8 quality ratings) are mostly in the top left part of the plot which has high alcohol content and low volatile acidity. The wines with orange and red colors(3 and 4 quality ratings) are mostly scattered within the bottom right part of the plot which has low alcohol content and high volatile acidity. The rest of wines with rating 5 or 6 comprising 82.5% of the wines, are located somewhere in between on the plot. Although the correlation coefficients were not very high, but the clear pattern demonstrated by the plot still motivated me in trying out the linear model on this dataset.


Reflection

The red wine dataset has contains 12 features on 1599 different wines. 11 out of the 12 are chemical properties of wines and 1 of them is quality rating evaluated by at least 3 wine experts. My main goal was to understand the dataset and be able to predict quality with the chemical properties.

After performing exploratory data analysis on this wine quality dataset, I’ve identified the most important features that determine the wine quality: alcohol, volatile acidity, sulphates and citric acid, total sulfur dioxide and chlorides content also play less important roles. However, quality is a very subjective feature, so my attempt in predicting it with linear model was not very successful, but this analysis still revealed the general pattern. I was particularly frustrated by the fact that none of the correlations stand out as much as those in the diamond dataset did. The fact that 82.5% of wines have quality of 5 or 6 make it so that I’m almost trying to predict an boolean variable: if properties add up, quality is 6; if not, quality is 5. This really limited the performance of my linear model. I also looked at how density and pH vary based on their relevant features and gained better understanding of how these objective quantities change. During the analysis, I struggled to understand the correlation between citric acid and volatile acid. Then I found out about the tendency for citric acid to convert to volatile acid. The linear model for density worked relatively well due to the fact that everything is physically measurable so it’s much more predictable by nature.

To further study how to predict wine quality, I would try to obtain a larger dataset with more evaluations on every single wine, so the quality feature is less random. I would also consider changing the way quality is defined, currently it’s the median of all evaluations with all evaluations being integers, so many wines have the exact same quality ratings, the fine differences between different wines due to the differences in their other features were rounded off, making quality very hard to predict. Thus I think taking the mean after getting rid of outliers might be a better way for the purpose of predicting quality.